1. Data Set Preparation

	1.1 Create the folder /root/TrainingOnHDP/dataset/spark in your sandbox

	1.2 Upload all data files (including subfolders) into your sandbox /root/TrainingOnHDP/dataset/spark

	1.3 Login into Sandbox, and run the followings:

		hadoop fs -mkdir /user/root/works
		hadoop fs -chmod -R 777 /user/root/works
		hadoop fs -mkdir /root
		hadoop fs -mkdir /root/labs
		hadoop fs -mkdir /root/labs/datasets
		hadoop fs -put /root/TrainingOnHDP/dataset /root/labs
		
		hadoop fs -put /root/TrainingOnHDP/dataset/people.txt /root/labs
		
		hadoop fs -chmod -R 777 /root/labs/datasets
		
2. Spark shell
	2.1 Launch Spark Shell via SSH:
		spark-shell
		
	2.2 Use your browser to open Spark Web URL console: localhost:4040 to explore the tabs of Jobs, Stages, Storage, Executor and SQL
	
	2.3 Type sc. (sc with a dot) and press TAB will print out all properties and operations of Spark Context
	
	2.4 Press Enter three times, you will be back to $scala prompt
	
	2.5 Type sqlContext. (sqlContext with a dot) and press TAB will print out all properties and operations of SQL Context
	
	2.6 Type in the following command, you will see the name of spark shell application, which actually is Spark Driver program
		sc.appName

	2.7 Type in the following command to dump the current shell configuration properties:
		sc.getConf.toDebugString
	
	2.8 Type in the following command to load the text files
		val rdd = sc.textFile("file:////root/TrainingOnHDP/dataset/spark/people.txt")
	
	2.9 Type in the followind command to check (which is actully action):
		rdd.toDebugString
		
	2.10 Enter the following command:
		rdd.count()
		
	2.11 Enter the following command to print out all properties and operations of MapPartitionsRDD:
		rdd. and Press TAB
		Press Enter three time to get back to prompt
		
	2.12 Type in the following command to creare a list of value:
		val seq = Array("Peter	Developer	21", "James	Manager	35", "Mary	QA	27")
		
	2.13 Enter the following command to make RDD from Collection:
		val rddFromArray = sc.makeRDD(seq)
	
	2.14 Enter the following command to print out all properties and operations of ParallelCollectionRDD:
		rddFromArray. and Press TAB
		Press Enterr three time to get back to prompt	
		
	2.15 Type in the following command to save RDD into HDFS as text file:
		rddFromArray.saveAsTextFile("/user/root/works/rddoutput")
	
	2.16 Use your browser to open Spark Web URL console: localhost:4040 to explore the completed Jobs and Stages
	
	2.17 Goto HDFS Explorer to verify if the file is created there

	2.18 Check spark version
		
		spark.version
		
	2.19 List imports	
		
		:imports
		
	2.20 Spark shell creates an instance of SparkSession under the name spark for you
		
		:type spark

		Spark shell creates an instance of SparkContext under the name sc for you
		
		:type sc
		
	
3. Parallelized Collections
	3.1 Example:
		val data = Array(1, 2, 3, 4, 5)
		val distData = sc.parallelize(data, 3)
		val result = distData.reduce((a, b) => a + b)
	
4. Working with Key-Value Pairs
	4.1 Example:
		val lines = sc.textFile("/root/labs/people2.txt")
		val pairs = lines.map(s => (s, 1))
		val counts = pairs.reduceByKey((a, b) => a + b)
		counts.toDebugString
		counts.collect().foreach(println)
		counts.take(1).foreach(println)

5. Creating a Pair RDD with Map
	5.1 Example:
		val clientRDD = sc.textFile("/root/labs/datasets/labs/people.txt");
		val pairRDD = clientRDD.map(_.split(',')).map(f=>(f(0),f(1)))
		pairRDD.collect().foreach(println)

6. RDD Persistence
	6.1 Example:
		pairRDD.cache()
		pairRDD.getStorageLevel
		pairRDD.getStorageLevel.useMemory
		import org.apache.spark.storage.StorageLevel._
		val pairRDD2 = clientRDD.map(_.split(',')).map(f=>(f(0),f(1)))
		pairRDD2.collect().foreach(println)
		pairRDD2.persist(MEMORY_AND_DISK)
		pairRDD2.getStorageLevel
		sc.getRDDStorageInfo
		sc.getPersistentRDD //check if the RDD is persisted
		pairRDD2.unpersist()
		
7. Broadcast Variables
	7.1 Example:
		val arr1 = (0 until 100).toArray
		for (i <- 0 until 3) {
			println("Iteration " + i);
			println("===========");
			val startTime = System.nanoTime;
			val barr1 = sc.broadcast(arr1);
			val observedSizes = sc.parallelize(1 to 10, 5).map(_ => barr1.value.size);
			observedSizes.collect().foreach(i => println(i));
			println("Iteration %d took %.0f milliseconds".format(i, (System.nanoTime - startTime) / 1E6))
		}	
	
	7.2 Example:	
		val arr2 = (0 until 1000).toArray
		val barr1 = sc.broadcast(arr1)
		val barr2 = sc.broadcast(arr2)
		val observedSizes1 = sc.parallelize(1 to 10, 5).map { _ => (barr1.value.size, barr2.value.size)}
		observedSizes1.collect().foreach(i => println(i))	
	

8. Accumulators
	8.1 Example:
		val accum = sc.accumulator(0, "My Accumulator")
		sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
		accum.value

9. Single Local File System RDD Partitioning
	9.1 Example:
		var rdd3 = sc.textFile("file:////root/TrainingOnHDP/dataset/spark/people.txt", 4)
		rdd3.getNumPartitions
	
	
